sample dataset
Data Distribution Valuation
Xu, Xinyi, Wang, Shuaiqi, Foo, Chuan-Sheng, Low, Bryan Kian Hsiang, Fanti, Giulia
Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor's data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Efficient and Accurate Explanation Estimation with Distribution Compression
Baniecki, Hubert, Casalicchio, Giuseppe, Bischl, Bernd, Biecek, Przemyslaw
Exact computation of various machine learning explanations requires numerous model evaluations and in extreme cases becomes impractical. The computational cost of approximation increases with an ever-increasing size of data and model parameters. Many heuristics have been proposed to approximate post-hoc explanations efficiently. This paper shows that the standard i.i.d. sampling used in a broad spectrum of algorithms for explanation estimation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. We show that CTE improves the estimation of removal-based local and global explanations with negligible computational overhead. It often achieves an on-par explanation approximation error using 2-3x less samples, i.e. requiring 2-3x less model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.
Utilizing Large Language Models to Identify Reddit Users Considering Vaping Cessation for Digital Interventions
Vuruma, Sai Krishna Revanth, Wu, Dezhi, Gupta, Saborny Sen, Aust, Lucas, Lookingbill, Valerie, Henry, Caleb, Ren, Yang, Kasson, Erin, Chen, Li-Shiun, Cavazos-Rehg, Patricia, Hu, Dian, Huang, Ming
The widespread adoption of social media platforms globally not only enhances users' connectivity and communication but also emerges as a vital channel for the dissemination of health-related information, thereby establishing social media data as an invaluable organic data resource for public health research. The surge in popularity of vaping or e-cigarette use in the United States and other countries has caused an outbreak of e-cigarette and vaping use-associated lung injury (EVALI), leading to hospitalizations and fatalities in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cession. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit vaping intentions. Leveraging large language models including both the latest GPT-4 and traditional BERT-based language models for sentence-level quit-vaping intention prediction tasks, this study compares the outcomes of these models against human annotations. Notably, when compared to human evaluators, GPT-4 model demonstrates superior consistency in adhering to annotation guidelines and processes, showcasing advanced capabilities to detect nuanced user quit-vaping intentions that human evaluators might overlook. These preliminary findings emphasize the potential of GPT-4 in enhancing the accuracy and reliability of social media data analysis, especially in identifying subtle users' intentions that may elude human detection.
Direct Zernike Coefficient Prediction from Point Spread Functions and Extended Images using Deep Learning
Kok, Yong En, Bentley, Alexander, Parkes, Andrew, Wright, Amanda J., Somekh, Michael G., Pound, Michael
Optical imaging quality can be severely degraded by system and sample induced aberrations. Existing adaptive optics systems typically rely on iterative search algorithm to correct for aberrations and improve images. This study demonstrates the application of convolutional neural networks to characterise the optical aberration by directly predicting the Zernike coefficients from two to three phase-diverse optical images. We evaluated our network on 600,000 simulated Point Spread Function (PSF) datasets randomly generated within the range of -1 to 1 radians using the first 25 Zernike coefficients. The results show that using only three phase-diverse images captured above, below and at the focal plane with an amplitude of 1 achieves a low RMSE of 0.10 radians on the simulated Point Spread Function (PSF) dataset. Furthermore, this approach directly predicts Zernike modes simulated extended 2D samples, while maintaining a comparable RMSE of 0.15 radians. We demonstrate that this approach is effective using only a single prediction step, or can be iterated a small number of times. This simple and straightforward technique provides rapid and accurate method for predicting the aberration correction using three or less phase-diverse images, paving the way for evaluation on real-world dataset.
Factoring Hate Speech: A New Annotation Framework to Study Hate Speech in Social Media
Ron, Gal, Levi, Effi, Oshri, Odelia, Shenhav, Shaul R.
Although this annotation Social media has come to constitute a space for scheme was designed to capture and characterize the propagation of hostility (see ElSherief et al., hate speech directed towards Jews, with the exception 2018, p. 1) and provides fertile grounds for the of one group-specific aspect, it is general radicalization of individuals in support of violent enough to be applied to any other group-directed extremist groups (Reynolds and Tuck, 2016; Mitts, hate speech.
Understanding Unsupervised Machine Learning
In supervised machine learning, we have a labeled dataset that is used to train the model. For example, we train a model to predict the prices of houses based on features like area, number of bedrooms, and location, etc. In unsupervised machine learning, we do not have a labeled dataset. The goal of unsupervised machine learning is to find patterns and relationships in data. Clustering is one of the most popular techniques used in unsupervised machine learning.
Interactive Pipeline and Composite Estimators for Your End-to-End ML Model - Open Data Science - Your News Source for AI, Machine Learning & more
A data science model development pipeline involves various components including data injection, data preprocessing, feature engineering, feature scaling, and modeling. A data scientist needs to write the learning and inference code for all the components. The code structure sometimes becomes messier and difficult to interpret for other team members, for machine learning projects with heterogeneous data. A pipeline is a very handy function that can sequentially ensemble all your model development components. Using a pipeline one can easily perform the learning and inference tasks in a comparatively cleaner code structure.
Interactive Pipeline and Composite Estimators for Your End-to-End ML Model
Interactive Pipeline and Composite Estimators for Your End-to-End ML Model Machine Learning Modeling posted by ODSC Community November 3, 2022 ODSC Community A data science model development pipeline involves various components including data injection, data preprocessing, feature engineering, feature scaling, and modeling. A data science model development pipeline involves various components including data injection, data preprocessing, feature engineering, feature scaling, and modeling. A data scientist needs to write the learning and inference code for all the components. The code structure sometimes becomes messier and difficult to interpret for other team members, for machine learning projects with heterogeneous data. A pipeline is a very handy function that can sequentially ensemble all your model development components.